{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<i>Copyright (c) Recommenders contributors.</i>\n",
                "\n",
                "<i>Licensed under the MIT License.</i>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# SAR Single Node on MovieLens (Python, CPU)\n",
                "\n",
                "In this example, we will walk through each step of the Simple Algorithm for Recommendation (SAR) algorithm using a Python single-node implementation.\n",
                "\n",
                "SAR is a fast, scalable, adaptive algorithm for personalized recommendations based on user transaction history. It is powered by understanding the similarity between items, and recommending similar items to those a user has an existing affinity for."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 1 SAR algorithm\n",
                "\n",
                "The following figure presents a high-level architecture of SAR. \n",
                "\n",
                "At a very high level, two intermediate matrices are created and used to generate a set of recommendation scores:\n",
                "\n",
                "- An item similarity matrix $S$ estimates item-item relationships.\n",
                "- An affinity matrix $A$ estimates user-item relationships.\n",
                "\n",
                "Recommendation scores are then created by computing the matrix multiplication $A\\times S$.\n",
                "\n",
                "Optional steps (e.g. \"time decay\" and \"remove seen items\") are described in the details below.\n",
                "\n",
                "<img src=\"https://recodatasets.z20.web.core.windows.net/images/sar_schema.svg?sanitize=true\">\n",
                "\n",
                "### 1.1 Compute item co-occurrence and item similarity\n",
                "\n",
                "SAR defines similarity based on item-to-item co-occurrence data. Co-occurrence is defined as the number of times two items appear together for a given user. We can represent the co-occurrence of all items as a $m\\times m$ matrix $C$, where $c_{i,j}$ is the number of times item $i$ occurred with item $j$, and $m$ is the total number of items.\n",
                "\n",
                "The co-occurence matric $C$ has the following properties:\n",
                "\n",
                "- It is symmetric, so $c_{i,j} = c_{j,i}$\n",
                "- It is nonnegative: $c_{i,j} \\geq 0$\n",
                "- The occurrences are at least as large as the co-occurrences. I.e., the largest element for each row (and column) is on the main diagonal: $\\forall(i,j) C_{i,i},C_{j,j} \\geq C_{i,j}$.\n",
                "\n",
                "Once we have a co-occurrence matrix, an item similarity matrix $S$ can be obtained by rescaling the co-occurrences according to a given metric. Options for the metric include `Jaccard`, `lift`, and `counts` (meaning no rescaling).\n",
                "\n",
                "\n",
                "If $c_{ii}$ and $c_{jj}$ are the $i$th and $j$th diagonal elements of $C$, the rescaling options are:\n",
                "\n",
                "- `Jaccard`: $s_{ij}=\\frac{c_{ij}}{(c_{ii}+c_{jj}-c_{ij})}$\n",
                "- `lift`: $s_{ij}=\\frac{c_{ij}}{(c_{ii} \\times c_{jj})}$\n",
                "- `counts`: $s_{ij}=c_{ij}$\n",
                "\n",
                "In general, using `counts` as a similarity metric favours predictability, meaning that the most popular items will be recommended most of the time. `lift` by contrast favours discoverability/serendipity: an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. `Jaccard` is a compromise between the two.\n",
                "\n",
                "\n",
                "### 1.2 Compute user affinity scores\n",
                "\n",
                "The affinity matrix in SAR captures the strength of the relationship between each individual user and the items that user has already interacted with. SAR incorporates two factors that can impact users' affinities: \n",
                "\n",
                "- It can consider information about the **type** of user-item interaction through differential weighting of different events (e.g. it may weigh events in which a user rated a particular item more heavily than events in which a user viewed the item).\n",
                "- It can consider information about **when** a user-item event occurred (e.g. it may discount the value of events that take place in the distant past.\n",
                "\n",
                "Formalizing these factors produces us an expression for user-item affinity:\n",
                "\n",
                "$$a_{ij}=\\sum_k w_k \\left(\\frac{1}{2}\\right)^{\\frac{t_0-t_k}{T}} $$\n",
                "\n",
                "where the affinity $a_{ij}$ for user $i$ and item $j$ is the weighted sum of all $k$ events involving user $i$ and item $j$. $w_k$ represents the weight of a particular event, and the power of 2 term reflects the temporally-discounted event. The $(\\frac{1}{2})^n$ scaling factor causes the parameter $T$ to serve as a half-life: events $T$ units before $t_0$ will be given half the weight as those taking place at $t_0$.\n",
                "\n",
                "Repeating this computation for all $n$ users and $m$ items results in an $n\\times m$ matrix $A$. Simplifications of the above expression can be obtained by setting all the weights equal to 1 (effectively ignoring event types), or by setting the half-life parameter $T$ to infinity (ignoring transaction times).\n",
                "\n",
                "### 1.3 Remove seen item\n",
                "\n",
                "Optionally we remove items which have already been seen in the training set, i.e. don't recommend items which have been previously bought by the user again.\n",
                "\n",
                "### 1.4 Top-k item calculation\n",
                "\n",
                "The personalized recommendations for a set of users can then be obtained by multiplying the affinity matrix ($A$) by the similarity matrix ($S$). The result is a recommendation score matrix, where each row corresponds to a user, each column corresponds to an item, and each entry corresponds to a user / item pair. Higher scores correspond to more strongly recommended items.\n",
                "\n",
                "It is worth noting that the complexity of recommending operation depends on the data size. SAR algorithm itself has $O(n^3)$ complexity. Therefore the single-node implementation is not supposed to handle large dataset in a scalable manner. Whenever one uses the algorithm, it is recommended to run with sufficiently large memory. "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 2 SAR single-node implementation\n",
                "\n",
                "The SAR implementation illustrated in this notebook was developed in Python, primarily with Python packages like `numpy`, `pandas`, and `scipy` which are commonly used in most of the data analytics / machine learning tasks. Details of the implementation can be found in [Recommenders/recommenders/models/sar/sar_singlenode.py](../../recommenders/models/sar/sar_singlenode.py)."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3 SAR single-node based movie recommender"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [],
            "source": [
                "import sys\n",
                "import logging\n",
                "import scipy\n",
                "import numpy as np\n",
                "import pandas as pd\n",
                "\n",
                "from recommenders.datasets import movielens\n",
                "from recommenders.datasets.python_splitters import python_stratified_split\n",
                "from recommenders.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k\n",
                "from recommenders.models.sar import SAR\n",
                "from recommenders.utils.notebook_utils import store_metadata\n",
                "\n",
                "print(f\"System version: {sys.version}\")\n",
                "print(f\"Pandas version: {pd.__version__}\")\n",
                "print(f\"NumPy version: {np.__version__}\")\n",
                "print(f\"SciPy version: {scipy.__version__}\")"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {
                "tags": [
                    "parameters"
                ]
            },
            "outputs": [],
            "source": [
                "# Top k items to recommend\n",
                "TOP_K = 10\n",
                "\n",
                "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
                "MOVIELENS_DATA_SIZE = \"100k\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [],
            "source": [
                "# set log level to INFO\n",
                "logging.basicConfig(\n",
                "    level=logging.DEBUG,\n",
                "    format=\"%(asctime)s %(levelname)-8s %(message)s\",\n",
                "    datefmt=\"%Y-%m-%d %H:%M:%S\",\n",
                ")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 3.1 Load Data\n",
                "\n",
                "SAR is intended to be used on interactions with the following schema:\n",
                "`<User ID>, <Item ID>, <Time>`. \n",
                "\n",
                "Each row represents a single interaction between a user and an item. These interactions might be different types of events on an e-commerce website, such as a user clicking to view an item, adding it to a shopping basket, following a recommendation link, and so on. \n",
                "\n",
                "The MovieLens dataset is well formatted interactions of Users providing Ratings to Movies (movie ratings are used as the event weight) - we will use it for the rest of the example."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [],
            "source": [
                "data = movielens.load_pandas_df(\n",
                "    size=MOVIELENS_DATA_SIZE,\n",
                "    header=[\"UserId\", \"MovieId\", \"Rating\", \"Timestamp\"],\n",
                "    title_col=\"Title\",\n",
                ")\n",
                "\n",
                "# Convert the float precision to 32-bit in order to reduce memory consumption\n",
                "data[\"Rating\"] = data[\"Rating\"].astype(np.float32)\n",
                "\n",
                "data.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 3.2 Split the data using the python random splitter provided in utilities:\n",
                "\n",
                "We split the full dataset into a `train` and `test` dataset to evaluate performance of the algorithm against a held-out set not seen during training. Because SAR generates recommendations based on user preferences, all users that are in the test set must also exist in the training set. For this case, we can use the provided `python_stratified_split` function which holds out a percentage (in this case 25%) of items from each user, but ensures all users are in both `train` and `test` datasets. Other options are available in the `dataset.python_splitters` module which provide more control over how the split occurs.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [],
            "source": [
                "header = {\n",
                "    \"col_user\": \"UserId\",\n",
                "    \"col_item\": \"MovieId\",\n",
                "    \"col_rating\": \"Rating\",\n",
                "    \"col_timestamp\": \"Timestamp\",\n",
                "    \"col_prediction\": \"Prediction\",\n",
                "}"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [],
            "source": [
                "train, test = python_stratified_split(\n",
                "    data, ratio=0.75, col_user=header[\"col_user\"], col_item=header[\"col_item\"], seed=42\n",
                ")\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "In this case, for the illustration purpose, the following parameter values are used:\n",
                "\n",
                "|Parameter|Value|Description|\n",
                "|---------|---------|-------------|\n",
                "|`similarity_type`|`jaccard`|Method used to calculate item similarity.|\n",
                "|`time_decay_coefficient`|30|Period in days (term of $T$ shown in the formula of Section 1.2)|\n",
                "|`time_now`|`None`|Time decay reference.|\n",
                "|`timedecay_formula`|`True`|Whether time decay formula is used.|"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 16,
            "metadata": {},
            "outputs": [],
            "source": [
                "model = SAR(\n",
                "    similarity_type=\"jaccard\", \n",
                "    time_decay_coefficient=30, \n",
                "    time_now=None, \n",
                "    timedecay_formula=True, \n",
                "    **header\n",
                ")"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 17,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "2023-07-04 09:49:54 INFO     Collecting user affinity matrix\n",
                        "2023-07-04 09:49:54 INFO     Calculating time-decayed affinities\n",
                        "2023-07-04 09:49:54 INFO     Creating index columns\n",
                        "2023-07-04 09:49:54 INFO     Building user affinity sparse matrix\n",
                        "2023-07-04 09:49:54 INFO     Calculating item co-occurrence\n",
                        "2023-07-04 09:49:55 INFO     Calculating item similarity\n",
                        "2023-07-04 09:49:55 INFO     Using jaccard based similarity\n",
                        "2023-07-04 09:49:55 INFO     Done training\n"
                    ]
                }
            ],
            "source": [
                "model.fit(train)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 18,
            "metadata": {
                "scrolled": true
            },
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "2023-07-04 09:49:57 INFO     Calculating recommendation scores\n",
                        "2023-07-04 09:49:57 INFO     Removing seen items\n"
                    ]
                }
            ],
            "source": [
                "top_k = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The final output from the `recommend_k_items` method generates recommendation scores for each user-item pair, which are shown as follows."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 19,
            "metadata": {
                "scrolled": true
            },
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>UserId</th>\n",
                            "      <th>MovieId</th>\n",
                            "      <th>Prediction</th>\n",
                            "      <th>Title</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>9420</th>\n",
                            "      <td>943</td>\n",
                            "      <td>176</td>\n",
                            "      <td>21.325644</td>\n",
                            "      <td>Aliens (1986)</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>9421</th>\n",
                            "      <td>943</td>\n",
                            "      <td>89</td>\n",
                            "      <td>20.901408</td>\n",
                            "      <td>Blade Runner (1982)</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>9422</th>\n",
                            "      <td>943</td>\n",
                            "      <td>82</td>\n",
                            "      <td>20.688100</td>\n",
                            "      <td>Jurassic Park (1993)</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>9423</th>\n",
                            "      <td>943</td>\n",
                            "      <td>172</td>\n",
                            "      <td>20.287318</td>\n",
                            "      <td>Empire Strikes Back, The (1980)</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>9424</th>\n",
                            "      <td>943</td>\n",
                            "      <td>423</td>\n",
                            "      <td>20.256682</td>\n",
                            "      <td>E.T. the Extra-Terrestrial (1982)</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>9425</th>\n",
                            "      <td>943</td>\n",
                            "      <td>195</td>\n",
                            "      <td>20.250996</td>\n",
                            "      <td>Terminator, The (1984)</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>9426</th>\n",
                            "      <td>943</td>\n",
                            "      <td>202</td>\n",
                            "      <td>20.145059</td>\n",
                            "      <td>Groundhog Day (1993)</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>9427</th>\n",
                            "      <td>943</td>\n",
                            "      <td>68</td>\n",
                            "      <td>19.983884</td>\n",
                            "      <td>Crow, The (1994)</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>9428</th>\n",
                            "      <td>943</td>\n",
                            "      <td>566</td>\n",
                            "      <td>19.820856</td>\n",
                            "      <td>Clear and Present Danger (1994)</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>9429</th>\n",
                            "      <td>943</td>\n",
                            "      <td>550</td>\n",
                            "      <td>19.804157</td>\n",
                            "      <td>Die Hard: With a Vengeance (1995)</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "      UserId  MovieId  Prediction                              Title\n",
                            "9420     943      176   21.325644                      Aliens (1986)\n",
                            "9421     943       89   20.901408                Blade Runner (1982)\n",
                            "9422     943       82   20.688100               Jurassic Park (1993)\n",
                            "9423     943      172   20.287318    Empire Strikes Back, The (1980)\n",
                            "9424     943      423   20.256682  E.T. the Extra-Terrestrial (1982)\n",
                            "9425     943      195   20.250996             Terminator, The (1984)\n",
                            "9426     943      202   20.145059               Groundhog Day (1993)\n",
                            "9427     943       68   19.983884                   Crow, The (1994)\n",
                            "9428     943      566   19.820856    Clear and Present Danger (1994)\n",
                            "9429     943      550   19.804157  Die Hard: With a Vengeance (1995)"
                        ]
                    },
                    "execution_count": 19,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "top_k_with_titles = top_k.join(\n",
                "    data[[\"MovieId\", \"Title\"]].drop_duplicates().set_index(\"MovieId\"),\n",
                "    on=\"MovieId\",\n",
                "    how=\"inner\",\n",
                ").sort_values(by=[\"UserId\", \"Prediction\"], ascending=False)\n",
                "\n",
                "top_k_with_titles.head(10)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 3.3 Evaluate the results\n",
                "\n",
                "It should be known that the recommendation scores generated by multiplying the item similarity matrix $S$ and the user affinity matrix $A$ **DOES NOT** have the same scale with the original explicit ratings in the movielens dataset. That is to say, SAR algorithm is meant for the task of *recommending relevent items to users* rather than *predicting explicit ratings for user-item pairs*. \n",
                "\n",
                "To this end, ranking metrics like precision@k, recall@k, etc., are more applicable to evaluate SAR algorithm. The following illustrates how to evaluate SAR model by using the evaluation functions provided in Recommenders library."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 20,
            "metadata": {},
            "outputs": [],
            "source": [
                "# all ranking metrics have the same arguments\n",
                "args = [test, top_k]\n",
                "kwargs = dict(\n",
                "    col_user=\"UserId\",\n",
                "    col_item=\"MovieId\",\n",
                "    col_rating=\"Rating\",\n",
                "    col_prediction=\"Prediction\",\n",
                "    relevancy_method=\"top_k\",\n",
                "    k=TOP_K,\n",
                ")\n",
                "\n",
                "eval_map = map_at_k(*args, **kwargs)\n",
                "eval_ndcg = ndcg_at_k(*args, **kwargs)\n",
                "eval_precision = precision_at_k(*args, **kwargs)\n",
                "eval_recall = recall_at_k(*args, **kwargs)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 21,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Model:\n",
                        "Top K:\t\t 10\n",
                        "MAP:\t\t 0.113796\n",
                        "NDCG:\t\t 0.384809\n",
                        "Precision@K:\t 0.331707\n",
                        "Recall@K:\t 0.182571\n"
                    ]
                }
            ],
            "source": [
                "print(f\"Model:\",\n",
                "      f\"Top K:\\t\\t {TOP_K}\",\n",
                "      f\"MAP:\\t\\t {eval_map:f}\",\n",
                "      f\"NDCG:\\t\\t {eval_ndcg:f}\",\n",
                "      f\"Precision@K:\\t {eval_precision:f}\",\n",
                "      f\"Recall@K:\\t {eval_recall:f}\", sep='\\n')"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Record results for tests - ignore this cell\n",
                "store_metadata(\"map\", eval_map)\n",
                "store_metadata(\"ndcg\", eval_ndcg)\n",
                "store_metadata(\"precision\", eval_precision)\n",
                "store_metadata(\"recall\", eval_recall)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## References\n",
                "Note SAR is a combinational algorithm that implements different industry heuristics. The followings are references that may be helpful in understanding the SAR logic and implementation. \n",
                "\n",
                "1. Badrul Sarwar, *et al*, \"Item-based collaborative filtering recommendation algorithms\", WWW, 2001.\n",
                "2. Scipy (sparse matrix), url: https://docs.scipy.org/doc/scipy/reference/sparse.html\n",
                "3. Asela Gunawardana and Guy Shani, \"A survey of accuracy evaluation metrics of recommendation tasks\", The Journal of Machine Learning Research, vol. 10, pp 2935-2962, 2009.\t"
            ]
        }
    ],
    "metadata": {
        "celltoolbar": "Tags",
        "kernelspec": {
            "display_name": "Python 3 (ipykernel)",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.9.16"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}
